Evaluation of Web-based Corpora: Effects of Seed Selection and Time Interval
نویسنده
چکیده
Recently, there have been efforts to construct written corpora by using the WWW. A promising approach to build Web corpora is to run automated queries to search engines and download pages found in this way. This makes it possible to build corpora rapidly and economically, but we cannot control what are contained in resulting corpora. Under these circumstances, it is important to verify the general nature of Web corpora. This study, in particular, investigated effects of two essential factors on three Japanese corpora that we built: seed terms used for queries; and time interval between different corpus construction sessions, which measures the stability of query results over time. We evaluated the corpora qualitatively, in terms of domains, genres and typical lexical items. Results show these two patterns: 1) both seed selection and time interval affect the distribution of text and lexicon; 2) the effect of seed selection is much stronger. The prominent effect of seed selection suggests that a good understanding of the cause-and-effect relation between seeds and retrieved documents is an important step to gain some control over the characteristics of Web corpora, in particular, for the construction of general corpora meant to represent a language as a whole.
منابع مشابه
Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents
Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...
متن کاملEffects of Selection on Genetic Parameters of Secale montanum Based on Seed Storage Protein Marker
Secale montanum is one of the important perennial grasses growingnaturally in arid to semiarid pastures and rangelands, with a typical Mediterraneanclimate, in northern and western Iran at altitudes of 800-2900 m. In this paper, seedstorage protein profiles of nine wild populations of S. montanum from differentregions of Iran and their phenotypically superior progenies as well as a multi-origin...
متن کاملSupplier selection with multi-criteria group decision making based on interval-valued intuitionistic fuzzy sets (case study on a project-based company)
Supplier selection can be considered as a complicated multi criteria decision-making problem.In this paper the problem of supplier selection is studied in the presence of conflicting evaluations and insufficient information about the criteria and different attitudes of decision makers towards the risk. Most of fuzzy approaches used in multi-criteria group decision making (MCGDM) are non-intuiti...
متن کاملHybrid multi-criteria group decision-making for supplier selection problem with interval-valued Intuitionistic fuzzy data
The main objectives of supply chain management are reducing the risk of supply chain and production cost, increase the income, improve the customer services, optimizing the achievement level, and business processes which would increase ability, competency, customer satisfaction, and profitability. Further, the process of selecting the appropriate supplier capable of providing buyerchr('39')s re...
متن کاملInterval MULTIMOORA method with target values of attributes based on interval distance and preference degree: biomaterials selection
A target-based MADM method covers beneficial and non-beneficial attributes besides target values for some attributes. Such techniques are considered as the comprehensive forms of MADM approaches. Target-based MADM methods can also be used in traditional decision-making problems in which beneficial and non-beneficial attributes only exist. In many practical selection problems, some attributes ha...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006